Pandas
pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.
pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.
The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:
Additional Recommended Resources:
Let's get started with our first pandas notebook!
import pandas as pd
Introduction to pandas Data Structures
pandas Series
pandas Series one-dimensional labeled array.
ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])
print(ser)
tom 100 bob foo nancy 300 dan bar eric 500 dtype: object
ser.index
Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')
ser.loc[['nancy','bob']]
nancy 300 bob foo dtype: object
ser[[4, 3, 1]]
eric 500 dan bar bob foo dtype: object
ser.iloc[2]
300
'bob' in ser
True
print(ser)
print(ser*2)
tom 100 bob foo nancy 300 dan bar eric 500 dtype: object tom 200 bob foofoo nancy 600 dan barbar eric 1000 dtype: object
ser[['nancy', 'eric']] ** 2
nancy 90000 eric 250000 dtype: object
pandas DataFrame
pandas DataFrame is a 2-dimensional labeled data structure.
Create DataFrame from dictionary of Python Series
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
df = pd.DataFrame(d)
df
one | two | |
---|---|---|
apple | 100.0 | 111.0 |
ball | 200.0 | 222.0 |
cerill | NaN | 333.0 |
clock | 300.0 | NaN |
dancy | NaN | 4444.0 |
Other way to do the same
d = {'one' : [100., 200.,float("NaN"), 300., float("NaN")],'two':[111., 222., 333., float("NaN"),4444.],"tmp_index":['apple', 'ball', 'cerill', 'clock', 'dancy']}
df=pd.DataFrame(data=d)
df.set_index("tmp_index",inplace=True)
df.index.name = None
df
one | two | |
---|---|---|
apple | 100.0 | 111.0 |
ball | 200.0 | 222.0 |
cerill | NaN | 333.0 |
clock | 300.0 | NaN |
dancy | NaN | 4444.0 |
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
df = pd.DataFrame(d)
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])
one | two | |
---|---|---|
dancy | NaN | 4444.0 |
ball | 200.0 | 222.0 |
apple | 100.0 | 111.0 |
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])
two | five | |
---|---|---|
dancy | 4444.0 | NaN |
ball | 222.0 | NaN |
apple | 111.0 | NaN |
Create DataFrame from list of Python dictionaries
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
pd.DataFrame(data)
alex | alice | dora | ema | joe | |
---|---|---|---|---|---|
0 | 1.0 | NaN | NaN | NaN | 2.0 |
1 | NaN | 20.0 | 10.0 | 5.0 | NaN |
pd.DataFrame(data, index=['orange', 'red'])
alex | alice | dora | ema | joe | |
---|---|---|---|---|---|
orange | 1.0 | NaN | NaN | NaN | 2.0 |
red | NaN | 20.0 | 10.0 | 5.0 | NaN |
pd.DataFrame(data, columns=['joe', 'dora','alice'])
joe | dora | alice | |
---|---|---|---|
0 | 2.0 | NaN | NaN |
1 | NaN | 10.0 | 20.0 |
Basic DataFrame operations
df
one | two | |
---|---|---|
apple | 100.0 | 111.0 |
ball | 200.0 | 222.0 |
cerill | NaN | 333.0 |
clock | 300.0 | NaN |
dancy | NaN | 4444.0 |
df['one']
apple 100.0 ball 200.0 cerill NaN clock 300.0 dancy NaN Name: one, dtype: float64
df['three'] = df['one'] * df['two']
df
one | two | three | |
---|---|---|---|
apple | 100.0 | 111.0 | 11100.0 |
ball | 200.0 | 222.0 | 44400.0 |
cerill | NaN | 333.0 | NaN |
clock | 300.0 | NaN | NaN |
dancy | NaN | 4444.0 | NaN |
df['flag'] = df['one'] > 250
df
one | two | three | flag | |
---|---|---|---|---|
apple | 100.0 | 111.0 | 11100.0 | False |
ball | 200.0 | 222.0 | 44400.0 | False |
cerill | NaN | 333.0 | NaN | False |
clock | 300.0 | NaN | NaN | True |
dancy | NaN | 4444.0 | NaN | False |
three = df.pop('three')
three
apple 11100.0 ball 44400.0 cerill NaN clock NaN dancy NaN Name: three, dtype: float64
df
one | two | flag | |
---|---|---|---|
apple | 100.0 | 111.0 | False |
ball | 200.0 | 222.0 | False |
cerill | NaN | 333.0 | False |
clock | 300.0 | NaN | True |
dancy | NaN | 4444.0 | False |
del df['two']
df
one | flag | |
---|---|---|
apple | 100.0 | False |
ball | 200.0 | False |
cerill | NaN | False |
clock | 300.0 | True |
dancy | NaN | False |
df.insert(2, 'copy_of_one', df['one'])
df
one | flag | copy_of_one | |
---|---|---|---|
apple | 100.0 | False | 100.0 |
ball | 200.0 | False | 200.0 |
cerill | NaN | False | NaN |
clock | 300.0 | True | 300.0 |
dancy | NaN | False | NaN |
df['one_upper_half'] = df['one'][:2]
df
one | flag | copy_of_one | one_upper_half | |
---|---|---|---|---|
apple | 100.0 | False | 100.0 | 100.0 |
ball | 200.0 | False | 200.0 | 200.0 |
cerill | NaN | False | NaN | NaN |
clock | 300.0 | True | 300.0 | NaN |
dancy | NaN | False | NaN | NaN |
df.dropna(axis=0,thresh=2)
one | flag | copy_of_one | one_upper_half | |
---|---|---|---|---|
apple | 100.0 | False | 100.0 | 100.0 |
ball | 200.0 | False | 200.0 | 200.0 |
clock | 300.0 | True | 300.0 | NaN |
Case Study: Movie Data Analysis
Please note that you will need to download the dataset.
Here are the links to the data source and location:
Once the download completes, please make sure the data files are in a directory called movielens
Let us look at the files in this dataset using the UNIX command ls.
%%bash
ls movielens/Large/
README.txt genome-scores.csv genome-tags.csv links.csv movies.csv ratings.csv tags.csv
%%bash
cat movielens/Large/movies.csv | wc -l
27279
%%bash
cat movielens/Large/ratings.csv | wc -l
20000264
%%bash
head -5 ./movielens/Large/ratings.csv
userId,movieId,rating,timestamp 1,2,3.5,1112486027 1,29,3.5,1112484676 1,32,3.5,1112484819 1,47,3.5,1112484727
Use Pandas to Read the Dataset
Using the read_csv function in pandas, we will ingest these three files.
movies = pd.read_csv('./movielens/Large/movies.csv', sep=',')
print(type(movies))
movies.head(15)
<class 'pandas.core.frame.DataFrame'>
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
5 | 6 | Heat (1995) | Action|Crime|Thriller |
6 | 7 | Sabrina (1995) | Comedy|Romance |
7 | 8 | Tom and Huck (1995) | Adventure|Children |
8 | 9 | Sudden Death (1995) | Action |
9 | 10 | GoldenEye (1995) | Action|Adventure|Thriller |
10 | 11 | American President, The (1995) | Comedy|Drama|Romance |
11 | 12 | Dracula: Dead and Loving It (1995) | Comedy|Horror |
12 | 13 | Balto (1995) | Adventure|Animation|Children |
13 | 14 | Nixon (1995) | Drama |
14 | 15 | Cutthroat Island (1995) | Action|Adventure|Romance |
tags = pd.read_csv('./movielens/Large/tags.csv', sep=',')
tags.head()
userId | movieId | tag | timestamp | |
---|---|---|---|---|
0 | 18 | 4141 | Mark Waters | 1240597180 |
1 | 65 | 208 | dark hero | 1368150078 |
2 | 65 | 353 | dark hero | 1368150079 |
3 | 65 | 521 | noir thriller | 1368149983 |
4 | 65 | 592 | dark hero | 1368150078 |
ratings = pd.read_csv('./movielens/Large/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 2 | 3.5 | 1112486027 |
1 | 1 | 29 | 3.5 | 1112484676 |
2 | 1 | 32 | 3.5 | 1112484819 |
3 | 1 | 47 | 3.5 | 1112484727 |
4 | 1 | 50 | 3.5 | 1112484580 |
For current analysis, we will remove the Timestamp ( we could get to it later if you want)
del ratings['timestamp']
del tags['timestamp']
row_0 = tags.iloc[0]
print(type(row_0))
print(row_0)
<class 'pandas.core.series.Series'> userId 18 movieId 4141 tag Mark Waters Name: 0, dtype: object
row_0.index
Index(['userId', 'movieId', 'tag'], dtype='object')
row_0['userId']
18
'rating' in row_0
False
row_0.name
0
row_0 = row_0.rename('first_row')
row_0.name
'first_row'
Let's look how the ratings are distributed!
ratings.describe()
userId | movieId | rating | |
---|---|---|---|
count | 2.000026e+07 | 2.000026e+07 | 2.000026e+07 |
mean | 6.904587e+04 | 9.041567e+03 | 3.525529e+00 |
std | 4.003863e+04 | 1.978948e+04 | 1.051989e+00 |
min | 1.000000e+00 | 1.000000e+00 | 5.000000e-01 |
25% | 3.439500e+04 | 9.020000e+02 | 3.000000e+00 |
50% | 6.914100e+04 | 2.167000e+03 | 3.500000e+00 |
75% | 1.036370e+05 | 4.770000e+03 | 4.000000e+00 |
max | 1.384930e+05 | 1.312620e+05 | 5.000000e+00 |
ratings.mode()
userId | movieId | rating | |
---|---|---|---|
0 | 118205 | 296 | 4.0 |
ratings.corr()
userId | movieId | rating | |
---|---|---|---|
userId | 1.000000 | -0.000850 | 0.001175 |
movieId | -0.000850 | 1.000000 | 0.002606 |
rating | 0.001175 | 0.002606 | 1.000000 |
filter_2 = ratings.loc[ratings['rating'] > 0]
filter_2.groupby("movieId").mean()
userId | rating | |
---|---|---|
movieId | ||
1 | 69282.396821 | 3.921240 |
2 | 69169.928202 | 3.211977 |
3 | 69072.079388 | 3.151040 |
4 | 69652.913280 | 2.861393 |
5 | 69113.475454 | 3.064592 |
6 | 69226.328633 | 3.834930 |
7 | 69100.961809 | 3.366484 |
8 | 68677.092580 | 3.142049 |
9 | 70310.064899 | 3.004924 |
10 | 69161.741045 | 3.430029 |
11 | 69529.290717 | 3.667713 |
12 | 69245.668661 | 2.619766 |
13 | 70136.308693 | 3.272416 |
14 | 69468.605945 | 3.432082 |
15 | 69273.411684 | 2.721993 |
16 | 68817.899103 | 3.787455 |
17 | 69093.916727 | 3.968573 |
18 | 69830.091293 | 3.373631 |
19 | 69367.608129 | 2.607412 |
20 | 69822.326151 | 2.880754 |
21 | 69448.155374 | 3.581689 |
22 | 68741.821011 | 3.319400 |
23 | 70304.317176 | 3.148235 |
24 | 68901.418517 | 3.199849 |
25 | 69241.775855 | 3.689510 |
26 | 70215.360799 | 3.628857 |
27 | 67274.806943 | 3.413520 |
28 | 69610.200698 | 4.057546 |
29 | 69010.756925 | 3.952230 |
30 | 70776.333333 | 3.633880 |
... | ... | ... |
131146 | 79570.000000 | 4.000000 |
131148 | 79570.000000 | 4.000000 |
131150 | 79570.000000 | 4.000000 |
131152 | 74937.000000 | 0.500000 |
131154 | 79570.000000 | 3.500000 |
131156 | 79570.000000 | 4.000000 |
131158 | 108819.000000 | 4.000000 |
131160 | 79570.000000 | 4.000000 |
131162 | 42229.000000 | 2.000000 |
131164 | 54560.000000 | 4.000000 |
131166 | 54560.000000 | 4.000000 |
131168 | 64060.000000 | 3.500000 |
131170 | 95841.000000 | 3.500000 |
131172 | 128309.000000 | 1.000000 |
131174 | 109286.000000 | 3.500000 |
131176 | 109286.000000 | 4.500000 |
131180 | 117144.000000 | 2.500000 |
131231 | 63046.000000 | 3.500000 |
131237 | 134701.000000 | 3.000000 |
131239 | 79570.000000 | 4.000000 |
131241 | 79570.000000 | 4.000000 |
131243 | 79570.000000 | 4.000000 |
131248 | 79570.000000 | 4.000000 |
131250 | 79570.000000 | 4.000000 |
131252 | 79570.000000 | 4.000000 |
131254 | 79570.000000 | 4.000000 |
131256 | 79570.000000 | 4.000000 |
131258 | 28906.000000 | 2.500000 |
131260 | 65409.000000 | 3.000000 |
131262 | 133047.000000 | 4.000000 |
26744 rows × 2 columns
movies.shape
(27278, 3)
movies.isnull().any()
movieId False title False genres False dtype: bool
ratings.shape
(20000263, 3)
ratings.isnull().any()
userId False movieId False rating False dtype: bool
tags.shape
(465564, 3)
tags.isnull().any()
userId False movieId False tag True dtype: bool
tags = tags.dropna()
tags.isnull().any()
userId False movieId False tag False dtype: bool
tags.shape
(465548, 3)
import matplotlib.pylab as plt
ratings.hist(column='rating', figsize=(15,10),bins=10)
plt.show()
tags['tag'].head()
0 Mark Waters 1 dark hero 2 dark hero 3 noir thriller 4 dark hero Name: tag, dtype: object
movies[['title','genres']].head()
title | genres | |
---|---|---|
0 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | Grumpier Old Men (1995) | Comedy|Romance |
3 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | Father of the Bride Part II (1995) | Comedy |
ratings[-10:]
userId | movieId | rating | |
---|---|---|---|
20000253 | 138493 | 60816 | 4.5 |
20000254 | 138493 | 61160 | 4.0 |
20000255 | 138493 | 65682 | 4.5 |
20000256 | 138493 | 66762 | 4.5 |
20000257 | 138493 | 68319 | 4.5 |
20000258 | 138493 | 68954 | 4.5 |
20000259 | 138493 | 69526 | 4.5 |
20000260 | 138493 | 69644 | 3.0 |
20000261 | 138493 | 70286 | 5.0 |
20000262 | 138493 | 71619 | 2.5 |
ratings.tail(10)
userId | movieId | rating | |
---|---|---|---|
20000253 | 138493 | 60816 | 4.5 |
20000254 | 138493 | 61160 | 4.0 |
20000255 | 138493 | 65682 | 4.5 |
20000256 | 138493 | 66762 | 4.5 |
20000257 | 138493 | 68319 | 4.5 |
20000258 | 138493 | 68954 | 4.5 |
20000259 | 138493 | 69526 | 4.5 |
20000260 | 138493 | 69644 | 3.0 |
20000261 | 138493 | 70286 | 5.0 |
20000262 | 138493 | 71619 | 2.5 |
tag_counts = tags['tag'].value_counts()
tag_counts.head().plot(kind='bar', figsize=(12,8))
<matplotlib.axes._subplots.AxesSubplot at 0x26381863b38>
tag_counts.head(60).plot(kind='bar', figsize=(12,8))
<matplotlib.axes._subplots.AxesSubplot at 0x26380380710>
tag_counts[60:100].plot(kind='bar', figsize=(12,8))
<matplotlib.axes._subplots.AxesSubplot at 0x26381902780>
is_highly_rated = ratings['rating'] >= 4.0
ratings[is_highly_rated].head()
userId | movieId | rating | |
---|---|---|---|
6 | 1 | 151 | 4.0 |
7 | 1 | 223 | 4.0 |
8 | 1 | 253 | 4.0 |
9 | 1 | 260 | 4.0 |
10 | 1 | 293 | 4.0 |
is_animation = movies['genres'].str.contains('Animation')
movies[is_animation].head(15)
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
12 | 13 | Balto (1995) | Adventure|Animation|Children |
47 | 48 | Pocahontas (1995) | Animation|Children|Drama|Musical|Romance |
236 | 239 | Goofy Movie, A (1995) | Animation|Children|Comedy|Romance |
241 | 244 | Gumby: The Movie (1995) | Animation|Children |
310 | 313 | Swan Princess, The (1994) | Animation|Children |
360 | 364 | Lion King, The (1994) | Adventure|Animation|Children|Drama|Musical|IMAX |
388 | 392 | Secret Adventures of Tom Thumb, The (1993) | Adventure|Animation |
547 | 551 | Nightmare Before Christmas, The (1993) | Animation|Children|Fantasy|Musical |
553 | 558 | Pagemaster, The (1994) | Action|Adventure|Animation|Children|Fantasy |
582 | 588 | Aladdin (1992) | Adventure|Animation|Children|Comedy|Musical |
588 | 594 | Snow White and the Seven Dwarfs (1937) | Animation|Children|Drama|Fantasy|Musical |
589 | 595 | Beauty and the Beast (1991) | Animation|Children|Fantasy|Musical|Romance|IMAX |
590 | 596 | Pinocchio (1940) | Animation|Children|Fantasy|Musical |
604 | 610 | Heavy Metal (1981) | Action|Adventure|Animation|Horror|Sci-Fi |
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count
movieId | |
---|---|
rating | |
0.5 | 239125 |
1.0 | 680732 |
1.5 | 279252 |
2.0 | 1430997 |
2.5 | 883398 |
3.0 | 4291193 |
3.5 | 2200156 |
4.0 | 5561926 |
4.5 | 1534824 |
5.0 | 2898660 |
average_rating = ratings[['movieId','rating']].groupby('movieId').mean() # We are not interested in the user that voted for it
average_rating.head()
rating | |
---|---|
movieId | |
1 | 3.921240 |
2 | 3.211977 |
3 | 3.151040 |
4 | 2.861393 |
5 | 3.064592 |
sorted_average_rating=average_rating.sort_values(by="rating",ascending=False)
sorted_average_rating.head()
rating | |
---|---|
movieId | |
95517 | 5.0 |
105846 | 5.0 |
89133 | 5.0 |
105187 | 5.0 |
105191 | 5.0 |
Do not sort the list but intead ask where we have that the rating score is $5.0$
average_rating.loc[average_rating.rating==5.0].head()
rating | |
---|---|
movieId | |
26718 | 5.0 |
27914 | 5.0 |
32230 | 5.0 |
40404 | 5.0 |
54326 | 5.0 |
But since we do not understand to what this Id movie is related, we would like to see intead the name of the movie. To do that, we need to see in the movies
DataFrame
id_movie=average_rating.loc[average_rating.rating==5.0].index
movies.loc[movies.movieId.isin(id_movie)].head()
movieId | title | genres | |
---|---|---|---|
9007 | 26718 | Life On A String (Bian chang Bian Zou) (1991) | Adventure|Drama|Fantasy|Musical |
9561 | 27914 | Hijacking Catastrophe: 9/11, Fear & the Sellin... | Documentary |
9862 | 32230 | Snow Queen, The (Lumikuningatar) (1986) | Children|Fantasy |
10567 | 40404 | Al otro lado (2004) | Drama |
12015 | 54326 | Sierra, La (2005) | Documentary |
tags.head()
userId | movieId | tag | |
---|---|---|---|
0 | 18 | 4141 | Mark Waters |
1 | 65 | 208 | dark hero |
2 | 65 | 353 | dark hero |
3 | 65 | 521 | noir thriller |
4 | 65 | 592 | dark hero |
movies.head()
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
t = pd.merge(movies,tags, on='movieId', how='inner')
t.head()
movieId | title | genres | userId | tag | |
---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 1644 | Watched |
1 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 1741 | computer animation |
2 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 1741 | Disney animated feature |
3 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 1741 | Pixar animation |
4 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 1741 | Téa Leoni does not star in this movie |
Check More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html
Combine aggreagation, merging, and filters to get useful analytics
avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()
movieId | rating | |
---|---|---|
0 | 1 | 3.921240 |
1 | 2 | 3.211977 |
2 | 3 | 3.151040 |
3 | 4 | 2.861393 |
4 | 5 | 3.064592 |
box_office = pd.merge(movies,avg_ratings, on='movieId', how='inner')
box_office.tail()
movieId | title | genres | rating | |
---|---|---|---|---|
26739 | 131254 | Kein Bund für's Leben (2007) | Comedy | 4.0 |
26740 | 131256 | Feuer, Eis & Dosenbier (2002) | Comedy | 4.0 |
26741 | 131258 | The Pirates (2014) | Adventure | 2.5 |
26742 | 131260 | Rentun Ruusu (2001) | (no genres listed) | 3.0 |
26743 | 131262 | Innocence (2014) | Adventure|Fantasy|Horror | 4.0 |
is_highly_rated = box_office['rating'] >= 4.0
box_office[is_highly_rated].tail()
movieId | title | genres | rating | |
---|---|---|---|---|
26737 | 131250 | No More School (2000) | Comedy | 4.0 |
26738 | 131252 | Forklift Driver Klaus: The First Day on the Jo... | Comedy|Horror | 4.0 |
26739 | 131254 | Kein Bund für's Leben (2007) | Comedy | 4.0 |
26740 | 131256 | Feuer, Eis & Dosenbier (2002) | Comedy | 4.0 |
26743 | 131262 | Innocence (2014) | Adventure|Fantasy|Horror | 4.0 |
is_comedy = box_office['genres'].str.contains('Comedy')
box_office[is_comedy].head()
movieId | title | genres | rating | |
---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 3.921240 |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance | 3.151040 |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance | 2.861393 |
4 | 5 | Father of the Bride Part II (1995) | Comedy | 3.064592 |
6 | 7 | Sabrina (1995) | Comedy|Romance | 3.366484 |
box_office[is_comedy & is_highly_rated].head()
movieId | title | genres | rating | |
---|---|---|---|---|
81 | 82 | Antonia's Line (Antonia) (1995) | Comedy|Drama | 4.004925 |
229 | 232 | Eat Drink Man Woman (Yin shi nan nu) (1994) | Comedy|Drama|Romance | 4.035610 |
293 | 296 | Pulp Fiction (1994) | Comedy|Crime|Drama|Thriller | 4.174231 |
352 | 356 | Forrest Gump (1994) | Comedy|Drama|Romance|War | 4.029000 |
602 | 608 | Fargo (1996) | Comedy|Crime|Drama|Thriller | 4.112359 |
movies.head()
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
Split 'genres' into multiple columns
movie_genres = movies['genres'].str.split('|', expand=True)
movie_genres.head(10)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Adventure | Animation | Children | Comedy | Fantasy | None | None | None | None | None |
1 | Adventure | Children | Fantasy | None | None | None | None | None | None | None |
2 | Comedy | Romance | None | None | None | None | None | None | None | None |
3 | Comedy | Drama | Romance | None | None | None | None | None | None | None |
4 | Comedy | None | None | None | None | None | None | None | None | None |
5 | Action | Crime | Thriller | None | None | None | None | None | None | None |
6 | Comedy | Romance | None | None | None | None | None | None | None | None |
7 | Adventure | Children | None | None | None | None | None | None | None | None |
8 | Action | None | None | None | None | None | None | None | None | None |
9 | Action | Adventure | Thriller | None | None | None | None | None | None | None |
Add a new column for comedy genre flag
movie_genres['IsComedy'] = movies['genres'].str.contains('Comedy')
movie_genres.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | IsComedy | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Adventure | Animation | Children | Comedy | Fantasy | None | None | None | None | None | True |
1 | Adventure | Children | Fantasy | None | None | None | None | None | None | None | False |
2 | Comedy | Romance | None | None | None | None | None | None | None | None | True |
3 | Comedy | Drama | Romance | None | None | None | None | None | None | None | True |
4 | Comedy | None | None | None | None | None | None | None | None | None | True |
Extract year from title e.g. (1995)
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
More here
tags = pd.read_csv('./movielens/Large/tags.csv', sep=',')
tags.dtypes
userId int64 movieId int64 tag object timestamp int64 dtype: object
Unix time / POSIX time / epoch time records
time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970
tags.head(5)
userId | movieId | tag | timestamp | |
---|---|---|---|---|
0 | 18 | 4141 | Mark Waters | 1240597180 |
1 | 65 | 208 | dark hero | 1368150078 |
2 | 65 | 353 | dark hero | 1368150079 |
3 | 65 | 521 | noir thriller | 1368149983 |
4 | 65 | 592 | dark hero | 1368150078 |
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')
Data Type datetime64[ns] maps to either
tags['parsed_time'].dtype
dtype('<M8[ns]')
tags.head(2)
userId | movieId | tag | timestamp | parsed_time | |
---|---|---|---|---|---|
0 | 18 | 4141 | Mark Waters | 1240597180 | 2009-04-24 18:19:40 |
1 | 65 | 208 | dark hero | 1368150078 | 2013-05-10 01:41:18 |
Selecting rows based on timestamps
greater_than_t = tags['parsed_time'] > '2015-02-01'
selected_rows = tags[greater_than_t]
print(tags.shape, selected_rows.shape)
(465564, 5) (12130, 5)
Sorting the table using the timestamps
tags.sort_values(by='parsed_time', ascending=True)[:10]
userId | movieId | tag | timestamp | parsed_time | |
---|---|---|---|---|---|
333932 | 100371 | 2788 | monty python | 1135429210 | 2005-12-24 13:00:10 |
333927 | 100371 | 1732 | coen brothers | 1135429236 | 2005-12-24 13:00:36 |
333924 | 100371 | 1206 | stanley kubrick | 1135429248 | 2005-12-24 13:00:48 |
333923 | 100371 | 1193 | jack nicholson | 1135429371 | 2005-12-24 13:02:51 |
333939 | 100371 | 5004 | peter sellers | 1135429399 | 2005-12-24 13:03:19 |
333922 | 100371 | 47 | morgan freeman | 1135429412 | 2005-12-24 13:03:32 |
333921 | 100371 | 47 | brad pitt | 1135429412 | 2005-12-24 13:03:32 |
333936 | 100371 | 4011 | brad pitt | 1135429431 | 2005-12-24 13:03:51 |
333937 | 100371 | 4011 | guy ritchie | 1135429431 | 2005-12-24 13:03:51 |
333920 | 100371 | 32 | bruce willis | 1135429442 | 2005-12-24 13:04:02 |
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()
movieId | rating | |
---|---|---|
26739 | 131254 | 4.0 |
26740 | 131256 | 4.0 |
26741 | 131258 | 2.5 |
26742 | 131260 | 3.0 |
26743 | 131262 | 4.0 |
joined = pd.merge(movies,average_rating, on='movieId', how='inner')
joined.head()
movieId | title | genres | year | rating | |
---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 1995 | 3.921240 |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | 1995 | 3.211977 |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance | 1995 | 3.151040 |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance | 1995 | 2.861393 |
4 | 5 | Father of the Bride Part II (1995) | Comedy | 1995 | 3.064592 |
joined.corr()
movieId | rating | |
---|---|---|
movieId | 1.000000 | -0.090369 |
rating | -0.090369 | 1.000000 |
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average.head(10)
year | rating | |
---|---|---|
0 | 1891 | 3.000000 |
1 | 1893 | 3.375000 |
2 | 1894 | 3.071429 |
3 | 1895 | 3.125000 |
4 | 1896 | 3.183036 |
5 | 1898 | 3.850000 |
6 | 1899 | 3.625000 |
7 | 1900 | 3.166667 |
8 | 1901 | 5.000000 |
9 | 1902 | 3.738189 |
yearly_average.plot(x='year', y='rating', figsize=(12,8), grid=True)
plt.show()